Wrapper Generation Supervised by a Noisy Crowd
نویسندگان
چکیده
We present solutions based on crowdsourcing platforms to support large-scale production of accurate wrappers around data-intensive websites. Our approach is based on supervised wrapper induction algorithms which demand the burden of generating the training data to the workers of a crowdsourcing platform. Workers are paid for answering simple membership queries chosen by the system. We present two algorithms: a single worker algorithm (alfη) and a multiple workers algorithm (alfred). Both the algorithms deal with the inherent uncertainty of the responses and use an active learning approach to select the most informative queries. alfred estimates the workers’ error rate to decide at runtime how many workers are needed. The experiments that we conducted on real and synthetic data are encouraging: our approach is able to produce accurate wrappers at a low cost, even in presence of workers with a significant error rate.
منابع مشابه
Semi-Automatic Wrapper Generation for Commercial Web Sources
Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present Wargo, a semi-automatic wrapper generation tool, which has been used by non-programmer...
متن کاملSupervised Wrapper Generation with Lixto
We illustrate basic features of the Lixto wrapper generator such as the user and system interaction, the capacious visual interface, the marking and selecting procedures, and the extraction tasks by describing the construction of a simple example program in the current Lixto prototype.
متن کاملControlling the effect of crowd noisy annotations in NLP Tasks
Natural Language Processing (NLP) is a sub-field of Artificial Intelligence and Linguistics, with the aim of studying problems in the automatic generation and understanding of natural language. It involves identifying and exploiting linguistic rules and variation with code to translate unstructured language data into information with a schema. Empirical methods in NLP employ machine learning te...
متن کاملDEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web
The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present DEXTER, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our ...
متن کاملA Supervised Visual Wrapper Generator for Web-Data Extraction
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013